Building and Explaining a Classifier using AutoMLx

by the Oracle AutoMLx Team


AutoMLx Classification Demo version 23.1.1.

Copyright (c) 2023 Oracle, Inc.

Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/

Overview of this Notebook¶

In this notebook we will build a classifier using the Oracle AutoMLx tool for the public Census Income dataset. The dataset is a binary classification dataset, and more details about the dataset can be found at https://archive.ics.uci.edu/ml/datasets/Adult. We explore the various options provided by the Oracle AutoMLx tool, allowing the user to exercise control over the AutoML training process. We then evaluate the different models trained by AutoML. Finally we provide an overview of the possibilites that Oracle AutoMLx offers for explaining the predictions of the tuned model.


Prerequisites¶

  • Experience level: Novice (Python and Machine Learning)
  • Professional experience: Some industry experience

Business Use¶

Data analytics and modeling problems using Machine Learning (ML) are becoming popular and often rely on data science expertise to build accurate ML models. Such modeling tasks primarily involve the following steps:

  • Preprocess dataset (clean, impute, engineer features, normalize).
  • Pick an appropriate model for the given dataset and prediction task at hand.
  • Tune the chosen model’s hyperparameters for the given dataset.

All of these steps are significantly time consuming and heavily rely on data scientist expertise. Unfortunately, to make this problem harder, the best feature subset, model, and hyperparameter choice widely varies with the dataset and the prediction task. Hence, there is no one-size-fits-all solution to achieve reasonably good model performance. Using a simple Python API, AutoML can quickly (faster) jump-start the datascience process with an accurately-tuned model and appropriate features for a given prediction task.

Table of Contents¶

  • 0. Setup
  • 1. Load the Census Income dataset
  • 2. AutoML
    • 2.0. Set the engine
    • 2.1. Create an Instance of Oracle AutoMLx
    • 2.2. Train a Model using AutoML
    • 2.3. Analyze the AutoML optimization process
      • 2.3.1. Algorithm Selection
      • 2.3.2. Adaptive Sampling
      • 2.3.3. Feature Selection
      • 2.3.4. Hyperparameter Tuning
      • 2.3.5. Confusion Matrix
    • 2.3. Analyze the AutoML optimization process
      • 2.4. Provide a Specific Model List to AutoML
      • 2.5. Increase the number of tuned models
      • 2.6. Specify a Different Scoring Metric to AutoML
      • 2.7. Specify a User-defined Scoring Function to AutoML
      • 2.8. Specify a time budget to AutoML
      • 2.9. Specify a minimum set of features to AutoML
  • 3. Machine Learning Explainability (MLX)
    • 3.1. Initialize an MLExplainer
    • 3.2. Model Explanations (Global Feature Importance)
    • 3.3. Feature Dependence Explanations (PDP + ICE)
    • 3.4. Prediction Explanations (Local Feature Importance)
      • 3.4.1. Aggregate Local Feature Importance
    • 3.5. Interactive What-If Explainers
    • 3.6. Advanced Feature Importance Options
      • 3.6.1. Change the number of iterations
      • 3.6.2. Include the effects of feature interactions (with Shapley feature importance)
      • 3.6.3. Explain the model or explain the world
      • 3.6.4. Local explanations with surrogate models (LIME+)
    • 3.7. Advanced Feature Dependence Options (ALE)
  • References

Setup¶

Basic setup for the Notebook.

In [1]:
! pip install seaborn==0.12.1

%matplotlib inline
%load_ext autoreload
%autoreload 2
Requirement already satisfied: seaborn==0.12.1 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (0.12.1)
Requirement already satisfied: numpy>=1.17 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from seaborn==0.12.1) (1.22.2)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from seaborn==0.12.1) (3.5.1)
Requirement already satisfied: pandas>=0.25 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from seaborn==0.12.1) (1.4.1)
Requirement already satisfied: cycler>=0.10 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (1.1.0)
Requirement already satisfied: pyparsing>=2.2.1 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (2.4.7)
Requirement already satisfied: packaging>=20.0 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (21.3)
Requirement already satisfied: fonttools>=4.22.0 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (4.29.1)
Requirement already satisfied: python-dateutil>=2.7 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (2.8.2)
Requirement already satisfied: pillow>=6.2.0 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (9.0.1)
Requirement already satisfied: pytz>=2020.1 in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from pandas>=0.25->seaborn==0.12.1) (2022.1)
Requirement already satisfied: six in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from cycler>=0.10->matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (1.16.0)
Requirement already satisfied: setuptools in /scratch_user/ypushak/automl-3/automl/py_3.8.7/lib/python3.8/site-packages (from kiwisolver>=1.0.1->matplotlib!=3.6.1,>=3.1->seaborn==0.12.1) (62.3.4)

[notice] A new release of pip available: 22.2.2 -> 23.0
[notice] To update, run: pip install --upgrade pip

Load the required modules.

In [2]:
import gzip
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_auc_score, confusion_matrix, make_scorer, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_selector as selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Settings for plots
plt.rcParams['figure.figsize'] = [10, 7]
plt.rcParams['font.size'] = 15
sns.set(color_codes=True)
sns.set(font_scale=1.5)
sns.set_palette("bright")
sns.set_style("whitegrid")

import automl
from automl import init

Load the Census Income dataset¶

We start by reading in the dataset from OpenML.

In [3]:
dataset = fetch_openml(name='adult', as_frame=True)
df, y = dataset.data, dataset.target

Lets look at a few of the values in the data

In [4]:
df.head()
Out[4]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country
0 2 State-gov 77516.0 Bachelors 13.0 Never-married Adm-clerical Not-in-family White Male 1 0 2 United-States
1 3 Self-emp-not-inc 83311.0 Bachelors 13.0 Married-civ-spouse Exec-managerial Husband White Male 0 0 0 United-States
2 2 Private 215646.0 HS-grad 9.0 Divorced Handlers-cleaners Not-in-family White Male 0 0 2 United-States
3 3 Private 234721.0 11th 7.0 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 2 United-States
4 1 Private 338409.0 Bachelors 13.0 Married-civ-spouse Prof-specialty Wife Black Female 0 0 2 Cuba

The Adult dataset contains a mix of numerical and string data, making it a challenging problem to train machine learning models on.

In [5]:
pd.DataFrame({'Data type': df.dtypes}).T
Out[5]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country
Data type category category float64 category float64 category category category category category category category category category

The dataset also contains a lot of missing values. The Oracle AutoMLx solution automatically handles missing values by dropping features with too many missing values, and filling in the remaining missing values based on the feature type.

In [6]:
pd.DataFrame({'% missing values': df.isnull().sum() * 100 / len(df)}).T
Out[6]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capitalgain capitalloss hoursperweek native-country
% missing values 0.0 5.730724 0.0 0.0 0.0 0.0 5.751198 0.0 0.0 0.0 0.0 0.0 0.0 1.754637

We visualize the distribution of the target variable in the training data.

In [7]:
y_df = pd.DataFrame(y)
y_df.columns = ['income']

ax = sns.countplot(x="income", data=y_df)
ax.bar_label(ax.containers[0])
Out[7]:
[Text(0, 0, '11687'), Text(0, 0, '37155')]

We now separate the predictions (y) from the training data (X) for both the training (70%) and test (30%) datasets. The training set will be used to create a Machine Learning model using AutoML, and the test set will be used to evaluate the model's performance on unseen data.

In [8]:
# Several of the columns are incorrectly labeled as category type in the original dataset
numeric_columns = ['age', 'capitalgain', 'capitalloss', 'hoursperweek']
for col in df.columns:
    if col in numeric_columns:
        df[col] = df[col].astype(int)
    

X_train, X_test, y_train, y_test = train_test_split(df,
                                                    y.map({'>50K': 1, '<=50K': 0}).astype(int),
                                                    train_size=0.7,
                                                    random_state=0)

X_train.shape, X_test.shape
Out[8]:
((34189, 14), (14653, 14))

AutoML¶

Setting the engine¶

The AutoML pipeline offers the function init, which allows to initialize the parallel engine. By default, the AutoML pipeline uses the dask parallel engine. One can also set the engine to local, which uses python's multiprocessing library for parallelism instead.

In [9]:
init(engine='local')
[2023-02-10 12:25:20,318] [automl.xengine] Local ProcessPool execution (n_jobs=40)

Create an instance of Oracle AutoMLx¶

The Oracle AutoMLx solution provides a pipeline that automatically finds a tuned model given a prediction task and a training dataset. In particular it allows to find a tuned model for any supervised prediction task, e.g. classification or regression where the target can be binary, categorical or real-valued.

AutoML consists of five main modules:

  • Preprocessing : Clean, impute, engineer, and normalize features.
  • Algorithm Selection : Identify the right classification algorithm -in this notebook- for a given dataset, choosing from amongst:
    • AdaBoostClassifier
    • DecisionTreeClassifier
    • ExtraTreesClassifier
    • TorchMLPClassifier
    • KNeighborsClassifier
    • LGBMClassifier
    • LinearSVC
    • LogisticRegression
    • RandomForestClassifier
    • SVC
    • XGBClassifier
    • GaussianNB
  • Adaptive Sampling : Select a subset of the data samples for the model to be trained on.
  • Feature Selection : Select a subset of the data features, based on the previously selected model.
  • Hyperparameter Tuning : Find the right model parameters that maximize score for the given dataset.

All these pieces are readily combined into a simple AutoML pipeline which automates the entire Machine Learning process with minimal user input/interaction.

Train a model using AutoML¶

The AutoML API is quite simple to work with. We create an instance of the pipeline. Next, the training data is passed to the fit() function which executes the previously mentioned steps.

In [10]:
est1 = automl.Pipeline(task='classification')
est1.fit(X_train, y_train)
[2023-02-10 12:25:20,732] [automl.pipeline] Random state (7) is used for model builds
[2023-02-10 12:25:20,778] [automl.pipeline] Dataset shape: (34189, 14)
[2023-02-10 12:25:20,879] [automl.pipeline] Running Auto-Preprocessing
[2023-02-10 12:25:21,125] [automl.pipeline] Preprocessing completed. Updated Dataset shape: (34189, 15), cv: 5
[2023-02-10 12:25:21,528] [automl.pipeline] SVC, KNeighborsClassifier are disabled for datasets with > 10K samples or > 1K features
[2023-02-10 12:25:21,529] [automl.pipeline] Running Model Selection
[2023-02-10 12:25:34,335] [automl.pipeline] Model Selection completed. Selected model: ['LGBMClassifier']
[2023-02-10 12:25:34,338] [automl.pipeline] Running Adaptive Sampling. Dataset Shape: (34189, 15), Valid Shape: None, CV: 5, Class counts: [25955  8234]
[2023-02-10 12:25:37,301] [automl.pipeline] Adaptive Sampling Completed. Updated Dataset Shape: (34189, 15), Valid Shape: None, CV: [(12342, 6838), (12342, 6838), (12342, 6838), (12342, 6838), (12342, 6837)], Class counts: [25955  8234]
[2023-02-10 12:25:37,302] [automl.pipeline] Starting Feature Selection 0. Dataset Shape: (34189, 15)
[2023-02-10 12:25:41,431] [automl.pipeline] Feature Selection 0 completed. Updated Dataset shape: (34189, 11)
[2023-02-10 12:25:41,511] [automl.pipeline] Tuning LGBMClassifier
[2023-02-10 12:25:51,856] [automl.pipeline] Tuning completed. Best params: {'boosting_type': 'gbdt', 'class_weight': 'balanced', 'learning_rate': 0.1020594989702264, 'max_depth': 5, 'min_child_weight': 0.001, 'n_estimators': 101, 'num_leaves': 31, 'reg_alpha': 0, 'reg_lambda': 1}
[2023-02-10 12:25:51,954] [automl.pipeline] (Re)fitting Pipeline
[2023-02-10 12:25:52,783] [automl.xengine] Local ProcessPool execution (n_jobs=40)
[2023-02-10 12:25:52,877] [automl.pipeline] AutoML completed. Time taken - 31.554 sec
Out[10]:
Pipeline(model_list=['DecisionTreeClassifier', 'ExtraTreesClassifier',
                     'GaussianNB', 'LGBMClassifier', 'LogisticRegression',
                     'RandomForestClassifier', 'SVC', 'XGBClassifier',
                     'TorchMLPClassifier', 'CatBoostClassifier'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(model_list=['DecisionTreeClassifier', 'ExtraTreesClassifier',
                     'GaussianNB', 'LGBMClassifier', 'LogisticRegression',
                     'RandomForestClassifier', 'SVC', 'XGBClassifier',
                     'TorchMLPClassifier', 'CatBoostClassifier'])

A model is then generated (est1) and can be used for prediction tasks. We use the roc_auc_score scoring metric to evaluate the performance of this model on unseen data (X_test).

In [11]:
y_proba = est1.predict_proba(X_test)
score_default = roc_auc_score(y_test, y_proba[:, 1])

print(f'Score on test data : {score_default}')
Score on test data : 0.9129413734433824

Analyze the AutoML optimization process¶

During the AutoML process, a summary of the optimization process is logged. It consists of:

  • Information about the training data .
  • Information about the AutoML Pipeline, such as:
    • Selected features that AutoML found to be most predictive in the training data;
    • Selected algorithm that was the best choice for this data;
    • Selected hyperparameters for the selected algorithm.

AutoML provides a print_summary API to output all the different trials performed.

In [12]:
est1.print_summary()
Training Dataset size (34189, 14)
Validation Dataset size None
CV 5
Optimization Metric neg_log_loss
Selected Features Index(['age', 'workclass', 'education-num', 'marital-status', 'occupation', 'relationship', 'sex_1', 'sex_2', 'capitalgain', 'capitalloss', 'hoursperweek'], dtype='object')
Selected Algorithm LGBMClassifier
Time taken 31.1614
Selected Hyperparameters {'boosting_type': 'gbdt', 'class_weight': 'balanced', 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1020594989702264, 'max_depth': 5, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 101, 'n_jobs': 16, 'num_leaves': 31, 'objective': None, 'random_state': 7, 'reg_alpha': 0, 'reg_lambda': 1, 'silent': 'warn', 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0}
AutoML version 23.1.1
Python version 3.8.7 (default, Aug 25 2022, 13:59:56) \n[GCC 8.5.0 20210514 (Red Hat 8.5.0-10.1.0.1)]
Algorithm #Samples #Features Mean Validation Score Hyperparameters CPU Time Memory Usage (GB)
LGBMClassifier_HT 34189 11 -0.3771 {'boosting_type': 'gbdt', 'class_weight': 'balanced', 'learning_rate': 0.1020594989702264, 'max_depth': 5, 'min_child_weight': 0.001, 'n_estimators': 101, 'num_leaves': 31, 'reg_alpha': 0, 'reg_lambda': 1} 0.9159 (0.0, None)
LGBMClassifier_HT 34189 11 -0.3772 {'boosting_type': 'gbdt', 'class_weight': 'balanced', 'learning_rate': 0.1020594989702264, 'max_depth': 5, 'min_child_weight': 0.001, 'n_estimators': 100, 'num_leaves': 31, 'reg_alpha': 0, 'reg_lambda': 1} 0.9012 (0.0, None)
LGBMClassifier_HT 34189 11 -0.3772 {'boosting_type': 'gbdt', 'class_weight': 'balanced', 'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 0.001, 'n_estimators': 100, 'num_leaves': 31, 'reg_alpha': 0.000737912177972937, 'reg_lambda': 1} 0.8261 (0.0, None)
LGBMClassifier_HT 34189 11 -0.3772 {'boosting_type': 'gbdt', 'class_weight': 'balanced', 'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 0.001, 'n_estimators': 100, 'num_leaves': 31, 'reg_alpha': 0.0002435571444373244, 'reg_lambda': 1} 0.7303 (0.0, None)
LGBMClassifier_HT 34189 11 -0.3772 {'boosting_type': 'gbdt', 'class_weight': 'balanced', 'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 0.001, 'n_estimators': 100, 'num_leaves': 31, 'reg_alpha': 0.00024396488949463078, 'reg_lambda': 1} 0.8530 (0.0, None)
... ... ... ... ... ... ...
LGBMClassifier_HT 34189 11 -0.6930 {'boosting_type': 'dart', 'class_weight': 'balanced', 'learning_rate': 0.0001, 'max_depth': 1, 'min_child_weight': 0.0, 'n_estimators': 5, 'num_leaves': 3, 'reg_alpha': 1e-10, 'reg_lambda': 1} 0.1675 (0.0, None)
LGBMClassifier_RandomForestClassifier_FS 34189 1 -0.6979 {'boosting_type': 'gbdt', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_weight': 0.001, 'n_estimators': 100, 'num_leaves': 31, 'reg_alpha': 0, 'reg_lambda': 1, 'class_weight': 'balanced'} 0.8031 0.0
GaussianNB_AS 5000 15 -0.9008 {} 0.1060 0.0
TorchMLPClassifier_AS 5000 15 -1.1046 {'activation': 'relu', 'class_weight': None, 'dropout': 0.1, 'l2_reg': 0.0, 'nr_layers': 1, 'nr_neurons': 100, 'optimizer': 'adam'} 28.5616 0.0134
DecisionTreeClassifier_AS 5000 15 -1.9674 {'class_weight': None, 'max_features': 1.0, 'min_samples_leaf': 0.000625, 'min_samples_split': 0.00125} 0.2801 0.0

We also provide the capability to visualize the results of each stage of the AutoML pipeline.

Algorithm Selection¶

The plot below shows the scores predicted by Algorithm Selection for each algorithm. The horizontal line shows the average score across all algorithms. Algorithms below the line are colored turquoise, whereas those with a score higher than the mean are colored teal. Here we can see that the LGBMClassifier achieved the highest predicted score (orange bar), and is chosen for subsequent stages of the Pipeline.

In [13]:
# Each trial is a tuple of
# (algorithm, no. samples, no. features, mean CV score, hyperparameters, 
# all CV scores, total CV time (s), memory usage (Gb))
trials = est1.model_selection_trials_
colors = []
scores = [x[3] for x in trials]
models = [x[0] for x in trials]
y_margin = 0.10 * (max(scores) - min(scores))
s = pd.Series(scores, index=models).sort_values(ascending=False)

for f in s.keys():
    if f == '{}_AS'.format(est1.selected_model_):
        colors.append('orange')
    elif s[f] >= s.mean():
        colors.append('teal')
    else:
        colors.append('turquoise')
        

fig, ax = plt.subplots(1)
ax.set_title("Algorithm Selection Trials")
ax.set_ylim(min(scores) - y_margin, max(scores) + y_margin)
ax.set_ylabel(est1.inferred_score_metric[0])
s.plot.bar(ax=ax, color=colors, edgecolor='black')
ax.axhline(y=s.mean(), color='black', linewidth=0.5)
plt.show()

Adaptive Sampling¶

Following Algorithm Selection, Adaptive Sampling aims to find the smallest dataset sample that can be created without compromising validation set score for the chosen model (LGBMClassifier). In this case, no subset is found to keep sufficient validation score, so we continue with the entire dataset.

In [14]:
# Each trial is a tuple of
# (algorithm, no. samples, no. features, mean CV score, hyperparameters, 
# all CV scores, total CV time (s), memory usage (Gb))
trials = est1.adaptive_sampling_trials_
scores = [x[3] for x in trials]
n_samples = [x[1] for x in trials]
y_margin = 0.10 * (max(scores) - min(scores))

fig, ax = plt.subplots(1)
ax.set_title("Adaptive Sampling ({})".format(trials[0][0]))
ax.set_xlabel('Dataset sample size')
ax.set_ylabel(est1.inferred_score_metric[0])
ax.grid(color='g', linestyle='-', linewidth=0.1)
ax.set_ylim(min(scores) - y_margin, max(scores) + y_margin)
ax.plot(n_samples, scores, 'k:', marker="s", color='teal', markersize=3)
plt.show()

Feature Selection¶

After finding a sample subset, the next step is to find a relevant feature subset to maximize score for the chosen algorithm. AutoMLx Feature Selection follows an intelligent search strategy, looking at various possible feature rankings and subsets, and identifying that smallest feature subset that does not compromise on score for the chosen algorithm (LGBMClassifier). The orange line shows the optimal number of features chosen by Feature Selection (in this case 11 features, droppping fnlwgt, education, race and native-country).

In [15]:
print(f"Features selected: {est1.selected_features_names_.to_list()}")
dropped_features = est1.pipeline.preprocessor.transformed_col_names.drop(est1.selected_features_names_)
print(f"Features dropped: {dropped_features.to_list()}")

# Each trial is a tuple of
# (algorithm, no. samples, no. features, mean CV score, hyperparameters, 
# all CV scores, total CV time (s), memory usage (Gb))
trials = est1.feature_selection_trials_
scores = [x[3] for x in trials]
n_features = [x[2] for x in trials]
y_margin = 0.10 * (max(scores) - min(scores))


fig, ax = plt.subplots(1)
ax.set_title("Feature Selection Trials")
ax.set_xlabel("Number of Features")
ax.set_ylabel(est1.inferred_score_metric[0])
ax.grid(color='g', linestyle='-', linewidth=0.1)
ax.set_ylim(min(scores) - y_margin, max(scores) + y_margin)
ax.plot(n_features, scores, 'k:', marker="s", color='teal', markersize=3)
ax.axvline(x=len(est1.selected_features_names_), color='orange', linewidth=2.0)
plt.show()
Features selected: ['age', 'workclass', 'education-num', 'marital-status', 'occupation', 'relationship', 'sex_1', 'sex_2', 'capitalgain', 'capitalloss', 'hoursperweek']
Features dropped: ['fnlwgt', 'education', 'race', 'native-country']

Hyperparameter Tuning¶

Hyperparameter Tuning is the last stage of the AutoML pipeline, and focuses on improving the chosen algorithm's score on the reduced dataset (after Adaptive Sampling and Feature Selection). We use a novel algorithm to search across many hyperparameters dimensions, and converge automatically when optimal hyperparameters are identified. Each trial in the graph below represents a particular hyperparameters configuration for the selected model.

In [16]:
# Each trial is a tuple of
# (algorithm, no. samples, no. features, mean CV score, hyperparameters, 
# all CV scores, total CV time (s), memory usage (Gb))
trials = est1.tuning_trials_
scores = [x[3] for x in reversed(trials)]
y_margin = 0.10 * (max(scores) - min(scores))


fig, ax = plt.subplots(1)
ax.set_title("Hyperparameter Tuning Trials")
ax.set_xlabel("Iteration $n$")
ax.set_ylabel(est1.inferred_score_metric[0])
ax.grid(color='g', linestyle='-', linewidth=0.1)
ax.set_ylim(min(scores) - y_margin, max(scores) + y_margin)
ax.plot(range(1, len(trials) + 1), scores, 'k:', marker="s", color='teal', markersize=3)
plt.show()

Confusion Matrix¶

We can use a Confusion Matrix to help us visualize the model's behavior. Note that the displayed confusion matrix represents percentages.

In [17]:
y_pred = est1.predict(X_test)
cm = confusion_matrix(y_test.astype(int), y_pred, labels=[False, True])
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

df_cm = pd.DataFrame(cm)
df_cm = df_cm * 100

ax = sns.heatmap(df_cm, annot=True, fmt='.2f', cbar=False, 
                 xticklabels=['<=50K', '>50K'], yticklabels=['<=50K', '>50K'])
ax.set_xlabel('Prediction')
ax.set_ylabel('Actual')
Out[17]:
Text(58.5, 0.5, 'Actual')

Provide a specific model list to AutoML¶

The Oracle AutoMLx solution also supports a model_list argument, allowing you to control what algorithms AutoML considers during its optimization process. model_list is specified as a list of strings, which can be any combination of

  • AdaBoostClassifier
  • DecisionTreeClassifier
  • ExtraTreesClassifier
  • TorchMLPClassifier
  • KNeighborsClassifier
  • LGBMClassifier
  • LinearSVC
  • LogisticRegression
  • RandomForestClassifier
  • SVC
  • XGBClassifier
  • GaussianNB

Here, we ask AutoML to only consider LogisticRegression. Note that it does not perform well compared to the full capabilites of the AutoML pipeline.

In [18]:
est2 = automl.Pipeline(model_list=['LogisticRegression'])
est2.fit(X_train, y_train)
y_proba = est2.predict_proba(X_test)
score_modellist = roc_auc_score(y_test, y_proba[:, 1])

print(f'ROC AUC Score on test data : {score_modellist}')
[2023-02-10 12:25:55,310] [automl.pipeline] Random state (7) is used for model builds
[2023-02-10 12:25:55,353] [automl.pipeline] Dataset shape: (34189, 14)
[2023-02-10 12:25:55,449] [automl.pipeline] Running Auto-Preprocessing
[2023-02-10 12:25:55,723] [automl.pipeline] Preprocessing completed. Updated Dataset shape: (34189, 15), cv: 5
[2023-02-10 12:25:55,897] [automl.pipeline] SVC, KNeighborsClassifier are disabled for datasets with > 10K samples or > 1K features
[2023-02-10 12:25:55,899] [automl.pipeline] Running Model Selection
[2023-02-10 12:25:55,900] [automl.pipeline] Algorithm Selection Disabled
[2023-02-10 12:25:55,901] [automl.pipeline] User provided model LogisticRegression is the only desired model
[2023-02-10 12:25:55,933] [automl.pipeline] Model Selection completed. Selected model: ['LogisticRegression']
[2023-02-10 12:25:55,934] [automl.pipeline] Running Adaptive Sampling. Dataset Shape: (34189, 15), Valid Shape: None, CV: 5, Class counts: [25955  8234]
[2023-02-10 12:26:01,625] [automl.pipeline] Adaptive Sampling Completed. Updated Dataset Shape: (34189, 15), Valid Shape: None, CV: [(6730, 6838), (6730, 6838), (6730, 6838), (6730, 6838), (6730, 6837)], Class counts: [25955  8234]
[2023-02-10 12:26:01,625] [automl.pipeline] Starting Feature Selection 0. Dataset Shape: (34189, 15)
[2023-02-10 12:26:04,774] [automl.pipeline] Feature Selection 0 completed. Updated Dataset shape: (34189, 13)
[2023-02-10 12:26:04,857] [automl.pipeline] Tuning LogisticRegression
[2023-02-10 12:26:06,626] [automl.pipeline] Tuning completed. Best params: {'C': 0.0363696875, 'class_weight': 'balanced', 'solver': 'liblinear'}
[2023-02-10 12:26:06,716] [automl.pipeline] (Re)fitting Pipeline
[2023-02-10 12:26:07,407] [automl.xengine] Local ProcessPool execution (n_jobs=40)
[2023-02-10 12:26:07,492] [automl.pipeline] AutoML completed. Time taken - 11.568 sec
ROC AUC Score on test data : 0.8884370604236482

Use a custom validation set¶

You can specify a custom validation set that you want AutoML to use to evaluate the quality of models and configurations.

In [19]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.7, random_state=0)
In [20]:
est3 = automl.Pipeline()
est3.fit(X_train, y_train, X_val, y_val)

y_proba = est3.predict_proba(X_test)
score_valset = roc_auc_score(y_test, y_proba[:, 1])

print(f'ROC AUC Score on test data : {score_valset}')
[2023-02-10 12:26:10,820] [automl.pipeline] Random state (7) is used for model builds
[2023-02-10 12:26:10,870] [automl.pipeline] Dataset shape: (23932, 14)
[2023-02-10 12:26:10,955] [automl.pipeline] Running Auto-Preprocessing
[2023-02-10 12:26:11,185] [automl.pipeline] Preprocessing completed. Updated Dataset shape: (23932, 15), cv: None
[2023-02-10 12:26:11,342] [automl.pipeline] SVC, KNeighborsClassifier are disabled for datasets with > 10K samples or > 1K features
[2023-02-10 12:26:11,343] [automl.pipeline] Running Model Selection
[2023-02-10 12:26:21,622] [automl.pipeline] Model Selection completed. Selected model: ['LGBMClassifier']
[2023-02-10 12:26:21,624] [automl.pipeline] Running Adaptive Sampling. Dataset Shape: (23932, 15), Valid Shape: (10257, 15), CV: None, Class counts: [18129  5803]
[2023-02-10 12:26:24,054] [automl.pipeline] Adaptive Sampling Completed. Updated Dataset Shape: (11378, 15), Valid Shape: (4554, 15), CV: None, Class counts: [5689 5689]
[2023-02-10 12:26:24,055] [automl.pipeline] Starting Feature Selection 0. Dataset Shape: (11378, 15)
[2023-02-10 12:26:25,706] [automl.pipeline] Feature Selection 0 completed. Updated Dataset shape: (11378, 13)
[2023-02-10 12:26:25,776] [automl.pipeline] Tuning LGBMClassifier
[2023-02-10 12:26:30,153] [automl.pipeline] Tuning completed. Best params: {'boosting_type': 'gbdt', 'class_weight': 'balanced', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_weight': 0.001, 'n_estimators': 100, 'num_leaves': 31, 'reg_alpha': 0.00018782927469744562, 'reg_lambda': 1}
[2023-02-10 12:26:30,295] [automl.pipeline] (Re)fitting Pipeline
[2023-02-10 12:26:31,035] [automl.xengine] Local ProcessPool execution (n_jobs=40)
[2023-02-10 12:26:31,121] [automl.pipeline] AutoML completed. Time taken - 19.712 sec
ROC AUC Score on test data : 0.9131845625956725

Increase the number of tuned models¶

By default, we only optimize the hyper-parameters of the single best model from the model selection stage of our pipeline. However, you can also optimize the hyper-parameters of the top n models. In this example, we will tell the pipeline to tune the hyper-parameters for the best two of the three specified models.

In [21]:
est4 = automl.Pipeline(n_algos_tuned=2, model_list=['GaussianNB', 'LogisticRegression', 'DecisionTreeClassifier'])
est4.fit(X_train, y_train)

y_proba = est4.predict_proba(X_test)
score_modellist = roc_auc_score(y_test, y_proba[:, 1])

print(f'ROC AUC Score on test data : {score_modellist}')
[2023-02-10 12:26:31,630] [automl.pipeline] Random state (7) is used for model builds
[2023-02-10 12:26:31,668] [automl.pipeline] Dataset shape: (23932, 14)
[2023-02-10 12:26:31,751] [automl.pipeline] Running Auto-Preprocessing
[2023-02-10 12:26:31,946] [automl.pipeline] Preprocessing completed. Updated Dataset shape: (23932, 15), cv: 5
[2023-02-10 12:26:32,116] [automl.pipeline] SVC, KNeighborsClassifier are disabled for datasets with > 10K samples or > 1K features
[2023-02-10 12:26:32,117] [automl.pipeline] Running Model Selection
[2023-02-10 12:26:39,740] [automl.pipeline] Model Selection completed. Selected model: ['LogisticRegression', 'GaussianNB']
[2023-02-10 12:26:39,742] [automl.pipeline] Running Adaptive Sampling. Dataset Shape: (23932, 15), Valid Shape: None, CV: 5, Class counts: [18129  5803]
[2023-02-10 12:26:40,898] [automl.pipeline] Adaptive Sampling Completed. Updated Dataset Shape: (23932, 15), Valid Shape: None, CV: [(5830, 4787), (5830, 4787), (5830, 4786), (5830, 4786), (5830, 4786)], Class counts: [18129  5803]
[2023-02-10 12:26:40,899] [automl.pipeline] Starting Feature Selection 0. Dataset Shape: (23932, 15)
[2023-02-10 12:26:43,763] [automl.pipeline] Feature Selection 0 completed. Updated Dataset shape: (23932, 15)
[2023-02-10 12:26:43,837] [automl.pipeline] Tuning LogisticRegression
[2023-02-10 12:26:45,492] [automl.pipeline] Tuning completed. Best params: {'C': 0.041489374999999995, 'class_weight': None, 'solver': 'lbfgs'}
[2023-02-10 12:26:45,615] [automl.pipeline] Starting Feature Selection 1. Dataset Shape: (23932, 15)
[2023-02-10 12:26:48,319] [automl.pipeline] Feature Selection 1 completed. Updated Dataset shape: (23932, 5)
[2023-02-10 12:26:48,393] [automl.pipeline] Tuning GaussianNB
[2023-02-10 12:26:48,503] [automl.pipeline] Tuning completed. Best params: {}
[2023-02-10 12:26:48,654] [automl.pipeline] (Re)fitting Pipeline
[2023-02-10 12:26:50,915] [automl.xengine] Local ProcessPool execution (n_jobs=40)
[2023-02-10 12:26:51,025] [automl.pipeline] AutoML completed. Time taken - 18.808 sec
ROC AUC Score on test data : 0.8558559973728848

Specify a different scoring metric to AutoML¶

By default, the score metric is set to neg_log_loss for classifcation and neg_mean_squared_error for regression.

The user can also choose another scoring metric. The list of possible metrics is given by:

  • For binary classification, one of 'roc_auc', 'accuracy', 'f1', 'precision', 'recall', 'f1_micro', 'f1_macro', 'f1_weighted', 'f1_samples', 'recall_micro', 'recall_macro', 'recall_weighted', 'recall_samples', 'precision_micro', 'precision_macro', 'precision_weighted', 'precision_samples'
  • For multiclass classification , one of 'neg_log_loss', 'recall_macro', 'accuracy','f1_micro', 'f1_macro', 'f1_weighted', 'f1_samples', 'recall_micro', 'recall_weighted', 'recall_samples', 'precision_micro', 'precision_macro', 'precision_weighted', 'precision_samples'
  • For regression, one of 'neg_mean_squared_error', 'r2', 'neg_mean_absolute_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error'

Here, we ask AutoML to optimize for the accuracy scoring metric.

In [22]:
est5 = automl.Pipeline(score_metric='accuracy', model_list=['LogisticRegression'])
est5.fit(X_train, y_train)

y_proba = est5.predict_proba(X_test)
score_scoringstr = roc_auc_score(y_test, y_proba[:, 1])

print(f'AUC ROC Score on test data : {score_scoringstr}')
[2023-02-10 12:26:51,500] [automl.pipeline] Random state (7) is used for model builds
[2023-02-10 12:26:51,541] [automl.pipeline] Dataset shape: (23932, 14)
[2023-02-10 12:26:51,637] [automl.pipeline] Running Auto-Preprocessing
[2023-02-10 12:26:51,840] [automl.pipeline] Preprocessing completed. Updated Dataset shape: (23932, 15), cv: 5
[2023-02-10 12:26:52,035] [automl.pipeline] SVC, KNeighborsClassifier are disabled for datasets with > 10K samples or > 1K features
[2023-02-10 12:26:52,035] [automl.pipeline] Running Model Selection
[2023-02-10 12:26:52,037] [automl.pipeline] Algorithm Selection Disabled
[2023-02-10 12:26:52,037] [automl.pipeline] User provided model LogisticRegression is the only desired model
[2023-02-10 12:26:52,081] [automl.pipeline] Model Selection completed. Selected model: ['LogisticRegression']
[2023-02-10 12:26:52,083] [automl.pipeline] Running Adaptive Sampling. Dataset Shape: (23932, 15), Valid Shape: None, CV: 5, Class counts: [18129  5803]
[2023-02-10 12:26:59,409] [automl.pipeline] Adaptive Sampling Completed. Updated Dataset Shape: (23932, 15), Valid Shape: None, CV: [(5830, 4787), (5830, 4787), (5830, 4786), (5830, 4786), (5830, 4786)], Class counts: [18129  5803]
[2023-02-10 12:26:59,409] [automl.pipeline] Starting Feature Selection 0. Dataset Shape: (23932, 15)
[2023-02-10 12:27:02,530] [automl.pipeline] Feature Selection 0 completed. Updated Dataset shape: (23932, 13)
[2023-02-10 12:27:02,608] [automl.pipeline] Tuning LogisticRegression
[2023-02-10 12:27:04,260] [automl.pipeline] Tuning completed. Best params: {'C': 0.0363696875, 'class_weight': None, 'solver': 'lbfgs'}
[2023-02-10 12:27:04,364] [automl.pipeline] (Re)fitting Pipeline
[2023-02-10 12:27:05,630] [automl.xengine] Local ProcessPool execution (n_jobs=40)
[2023-02-10 12:27:05,720] [automl.pipeline] AutoML completed. Time taken - 13.672 sec
AUC ROC Score on test data : 0.8555359470026064

Specify a user defined scoring function to AutoML¶

Alternatively, the score_metric can also be specified as a user defined function of the form

def score_fn(model, X, y):
    # logic here
    return score

Alternatively, the scoring function can be defined as

def score_fn(y_true, y_pred):
    # logic here
    return score

and then converted into the above format using scikit-learn's make_scorer function.

Here, we leverage scikit-learn's implementation of the f1_score function. We then create a scorer function (score_fn) and pass it to the score_metric argument of train.

In [23]:
# Define the scoring function
score_fn = make_scorer(f1_score, greater_is_better=True, needs_proba=False, average='macro')

est6 = automl.Pipeline(score_metric=score_fn, model_list=['LogisticRegression'])
est6.fit(X_train, y_train)

y_proba = est6.predict_proba(X_test)
score_scoringsfn = roc_auc_score(y_test, y_proba[:, 1])

print(f'AUC ROC Score on test data : {score_scoringsfn}')
[2023-02-10 12:27:06,202] [automl.pipeline] Random state (7) is used for model builds
[2023-02-10 12:27:06,244] [automl.pipeline] Dataset shape: (23932, 14)
[2023-02-10 12:27:06,350] [automl.pipeline] Running Auto-Preprocessing
[2023-02-10 12:27:06,562] [automl.pipeline] Preprocessing completed. Updated Dataset shape: (23932, 15), cv: 5
[2023-02-10 12:27:06,765] [automl.pipeline] SVC, KNeighborsClassifier are disabled for datasets with > 10K samples or > 1K features
[2023-02-10 12:27:06,766] [automl.pipeline] Running Model Selection
[2023-02-10 12:27:06,767] [automl.pipeline] Algorithm Selection Disabled
[2023-02-10 12:27:06,768] [automl.pipeline] User provided model LogisticRegression is the only desired model
[2023-02-10 12:27:06,799] [automl.pipeline] Model Selection completed. Selected model: ['LogisticRegression']
[2023-02-10 12:27:06,800] [automl.pipeline] Running Adaptive Sampling. Dataset Shape: (23932, 15), Valid Shape: None, CV: 5, Class counts: [18129  5803]
[2023-02-10 12:27:16,396] [automl.pipeline] Adaptive Sampling Completed. Updated Dataset Shape: (23932, 15), Valid Shape: None, CV: [(5830, 4787), (5830, 4787), (5830, 4786), (5830, 4786), (5830, 4786)], Class counts: [18129  5803]
[2023-02-10 12:27:16,397] [automl.pipeline] Starting Feature Selection 0. Dataset Shape: (23932, 15)
[2023-02-10 12:27:19,403] [automl.pipeline] Feature Selection 0 completed. Updated Dataset shape: (23932, 13)
[2023-02-10 12:27:19,488] [automl.pipeline] Tuning LogisticRegression
[2023-02-10 12:27:21,203] [automl.pipeline] Tuning completed. Best params: {'C': 0.0363696875, 'class_weight': None, 'solver': 'lbfgs'}
[2023-02-10 12:27:21,338] [automl.pipeline] (Re)fitting Pipeline
[2023-02-10 12:27:22,704] [automl.xengine] Local ProcessPool execution (n_jobs=40)
[2023-02-10 12:27:22,803] [automl.pipeline] AutoML completed. Time taken - 16.054 sec
AUC ROC Score on test data : 0.8555359470026064

Specify a time budget to AutoML¶

The Oracle AutoMLx tool also supports a user given time budget in seconds. Given the small size of this dataset, we give a small time budget of 10 seconds using the time_budget argument.

In [24]:
est7 = automl.Pipeline()
est7.fit(X_train, y_train, time_budget=10)

y_proba = est5.predict_proba(X_test)
score_timebudget = roc_auc_score(y_test, y_proba[:, 1])

print(f'AUC ROC Score on test data : {score_timebudget}')
[2023-02-10 12:27:24,486] [automl.pipeline] Random state (7) is used for model builds
[2023-02-10 12:27:24,529] [automl.pipeline] Dataset shape: (23932, 14)
[2023-02-10 12:27:24,625] [automl.pipeline] Running Auto-Preprocessing
[2023-02-10 12:27:24,818] [automl.pipeline] Preprocessing completed. Updated Dataset shape: (23932, 15), cv: 5
[2023-02-10 12:27:25,051] [automl.pipeline] SVC, KNeighborsClassifier are disabled for datasets with > 10K samples or > 1K features
[2023-02-10 12:27:25,053] [automl.pipeline] Running Model Selection
Time budget exceeded by 1.89s, resetting XEngine
Timebudget Exceeded or Timedout completed 36/45, 0 tasks timedout
[2023-02-10 12:27:36,457] [automl.pipeline] Model Selection completed. Selected model: ['XGBClassifier']
[2023-02-10 12:27:36,458] [automl.pipeline] Running Adaptive Sampling. Dataset Shape: (23932, 15), Valid Shape: None, CV: 5, Class counts: [18129  5803]
[2023-02-10 12:27:36,547] [automl.pipeline] Timebudget exhausted. Skipping Adaptive Sampling
[2023-02-10 12:27:36,577] [automl.pipeline] Adaptive Sampling Completed. Updated Dataset Shape: (23932, 15), Valid Shape: None, CV: 5, Class counts: [18129  5803]
[2023-02-10 12:27:36,577] [automl.pipeline] Starting Feature Selection 0. Dataset Shape: (23932, 15)
[2023-02-10 12:27:36,661] [automl.pipeline] Timebudget exhausted. Skipping Feature Selection
[2023-02-10 12:27:36,662] [automl.pipeline] Using all features: Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex_1',
       'sex_2', 'capitalgain', 'capitalloss', 'hoursperweek',
       'native-country'],
      dtype='object')
[2023-02-10 12:27:36,691] [automl.pipeline] Feature Selection 0 completed. Updated Dataset shape: (23932, 15)
[2023-02-10 12:27:36,772] [automl.pipeline] Timebudget exhausted. Skipping Hyperparameter Optimization for XGBClassifier
[2023-02-10 12:27:36,806] [automl.pipeline] (Re)fitting Pipeline
[2023-02-10 12:27:37,721] [automl.xengine] Local ProcessPool execution (n_jobs=40)
[2023-02-10 12:27:37,826] [automl.pipeline] AutoML completed. Time taken - 12.780 sec
AUC ROC Score on test data : 0.8555359470026064

Specify a minimum feature list to AutoML¶

The Oracle AutoMLx Pipeline also supports a min_features argument. AutoML will ensure that these features are part of the final model that it creates, and these are not dropped during the feature selection phase. It can take 3 possible types of values:

  • If int, 0 < min_features <= n_features,
  • If float, 0 < min_features <= 1.0, 1.0 means disabling feature selection
  • If list, names of features to keep, for example ['a', 'b'] means keep features 'a' and 'b'
In [25]:
est8 = automl.Pipeline(min_features=['fnlwgt', 'native-country'],  model_list=['LogisticRegression'])
est8.fit(X_train, y_train)
y_proba = est6.predict_proba(X_test)
score_minfeatures = roc_auc_score(y_test, y_proba[:, 1])

print(f'Score on test data : {score_minfeatures}')
print(f'Selected features: {est6.selected_features_names_}')
[2023-02-10 12:27:38,400] [automl.pipeline] Random state (7) is used for model builds
[2023-02-10 12:27:38,439] [automl.pipeline] Dataset shape: (23932, 14)
[2023-02-10 12:27:38,528] [automl.pipeline] Running Auto-Preprocessing
[2023-02-10 12:27:38,718] [automl.pipeline] Preprocessing completed. Updated Dataset shape: (23932, 15), cv: 5
[2023-02-10 12:27:38,939] [automl.pipeline] SVC, KNeighborsClassifier are disabled for datasets with > 10K samples or > 1K features
[2023-02-10 12:27:38,940] [automl.pipeline] Running Model Selection
[2023-02-10 12:27:38,941] [automl.pipeline] Algorithm Selection Disabled
[2023-02-10 12:27:38,942] [automl.pipeline] User provided model LogisticRegression is the only desired model
[2023-02-10 12:27:38,971] [automl.pipeline] Model Selection completed. Selected model: ['LogisticRegression']
[2023-02-10 12:27:38,972] [automl.pipeline] Running Adaptive Sampling. Dataset Shape: (23932, 15), Valid Shape: None, CV: 5, Class counts: [18129  5803]
[2023-02-10 12:27:46,098] [automl.pipeline] Adaptive Sampling Completed. Updated Dataset Shape: (23932, 15), Valid Shape: None, CV: [(5830, 4787), (5830, 4787), (5830, 4786), (5830, 4786), (5830, 4786)], Class counts: [18129  5803]
[2023-02-10 12:27:46,099] [automl.pipeline] Starting Feature Selection 0. Dataset Shape: (23932, 15)
[2023-02-10 12:27:48,859] [automl.pipeline] Feature Selection 0 completed. Updated Dataset shape: (23932, 15)
[2023-02-10 12:27:48,954] [automl.pipeline] Tuning LogisticRegression
[2023-02-10 12:27:50,703] [automl.pipeline] Tuning completed. Best params: {'C': 0.041489374999999995, 'class_weight': None, 'solver': 'lbfgs'}
[2023-02-10 12:27:50,842] [automl.pipeline] (Re)fitting Pipeline
[2023-02-10 12:27:52,157] [automl.xengine] Local ProcessPool execution (n_jobs=40)
[2023-02-10 12:27:52,254] [automl.pipeline] AutoML completed. Time taken - 13.302 sec
Score on test data : 0.8555359470026064
Selected features: Index(['age', 'workclass', 'education', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex_1', 'sex_2', 'capitalgain',
       'capitalloss', 'hoursperweek'],
      dtype='object')

Machine Learning Explainability¶

For a variety of decision-making tasks, getting only a prediction as model output is not sufficient. A user may wish to know why the model outputs that prediction, or which data features are relevant for that prediction. For that purpose the Oracle AutoMLx solution defines the MLExplainer object, which allows to compute a variety of model explanations

Initializing an MLExplainer¶

The MLExplainer object takes as argument the trained model, the training data and labels, as well as the task.

In [26]:
explainer = automl.MLExplainer(est1,
                               X_train,
                               y_train,
                               target_names=["<=50K", ">50K"],
                               task="classification")

Model Explanations (Global Feature importance)¶

The notion of Global Feature Importance intuitively measures how much the model's performance (relative to the provided train labels) would change if a given feature were dropped from the dataset, without retraining the model. This notion of feature importance considers each feature independently from all other features.

Computing the importance¶

By default we use a permutation method to successively measure the importance of each feature. Such a method therefore runs in linear time with respect to the number of features in the dataset.

The method explain_model() allows to compute such feature importances. It also provides 95% confidence intervals for each feature importance attribution.

In [27]:
result_explain_model_default = explainer.explain_model()

Visualization¶

There are two options to show the explanation's results:

  • to_dataframe() will return a dataframe of the results.
  • show_in_notebook() will show the results as a bar plot.

The features are returned in decreasing order of importance. We see that marital-status and education-num are considered to be the most important features.

In [28]:
result_explain_model_default.to_dataframe()
Out[28]:
feature attribution upper_bound lower_bound
0 marital-status 0.064708 0.068405 0.061011
1 education-num 0.032377 0.037343 0.027410
2 capitalgain 0.028913 0.030922 0.026903
3 hoursperweek 0.009765 0.014313 0.005217
4 occupation 0.008872 0.015387 0.002357
5 age 0.007591 0.010669 0.004512
6 workclass 0.003108 0.003659 0.002558
7 capitalloss 0.002685 0.003707 0.001662
8 relationship -0.003469 -0.001035 -0.005903
9 sex -0.003851 -0.000877 -0.006825
In [29]:
result_explain_model_default.show_in_notebook()

Feature Dependence Explanations (PDP + ICE)¶

Another way to measure dependency on a feature is through a partial dependence plot (PDP) or an individual conditional expectation (ICE) plot. For accumulated local effects (ALE) explanations, see Advanced Feature Dependence Options (ALE)

Given a dataset, a PDP displays the average output of the model as a function of the value of the selected set of features.

It can be computed for a single feature, as in the cell below. The X-axis is the value of the education-num feature and the y-axis is the corresponding outputted price. Since we are considering the whole dataset, the average over outputs is given by the red line, while the shaded interval corresponds to a 95% confidence interval for the average.

The histogram on top shows the distribution of the value of the education-num feature in the dataset.

In [30]:
result_explain_feature_dependence_default = explainer.explain_feature_dependence('education-num')
result_explain_feature_dependence_default.show_in_notebook()

The ICE plot is automatically computed at the same time as any one-feature PDP. It can be accessed by passing ice=True to show_in_notebook.

Similar to PDPs, ICE plots show the median prediction as a model red line. However, the variance in the model's predictions are shown by plotting the predictions of a sample of individual data instances as light grey lines. (For categorical features, the distribution in the predictions is instead shown as a violin plot.)

In [31]:
result_explain_feature_dependence_default.show_in_notebook(ice=True)

We can also plot the PDP for up to four features. The plot below is the PDP for the education-num and sex features. The X-axis still shows the values of education-num, while there is a different bar and confidence interval for each value of the feature sex.

The histogram displays the joint distribution of the two features.

In [32]:
result_explain_feature_dependence_default = explainer.explain_feature_dependence(['education-num', 'sex'])
result_explain_feature_dependence_default.show_in_notebook()

PDPs can be comptued for an arbitrary number of variables; however, they can only be visualized with up to 4. We show an example with 3 below.

In [33]:
result_explain_feature_dependence_default = explainer.explain_feature_dependence(['education-num', 'hoursperweek', 'sex'])
result_explain_feature_dependence_default.show_in_notebook()

Prediction Explanations (Local Feature Importance)¶

Given a data sample, one can also obtain the local importance, which is the importance of the features for the model's prediction on that sample. In the following cell, we consider sample $0$. The function explain_prediction() computes the local importance for a given sample.

education-num=12.0 means that the value of feature education-num for that sample is 12.0. Removing that feature would change the model's prediction by the magnitude of the bar. That is, in this case, the model's prediction for the probability that the person makes less than 50K is approximately 0.05-0.10 larger because the model knows the value of education-num is 12.0.

In [34]:
index = 0
result_explain_prediction_default = explainer.explain_prediction(X_train.iloc[index:index + 1,:])
result_explain_prediction_default[0].show_in_notebook()

Aggregate Local Feature Importance¶

We now summarize all of the individual local feature importance explanations into one single aggregate explanation.

In [35]:
# We select 10 random instances here as an example and show the aggregate explanation of those instances.
local_explanations = explainer.explain_prediction(X_train.sample(n=10))
alfi = explainer.aggregate(explanations=local_explanations)
alfi.show_in_notebook()
Out[35]:

Interactive What-If Explainers¶

The Oracle AutoMLx solution offers also What-IF tool to explain a trained ML model's predictions through a simple interactive interface.

You can use What-IF explainer to explore and visualize immediately how changing a sample value will affect the model's prediction. Forthermore, What-IF can be used to visualize how model's predictions are related to any feature of the dataset.

In [36]:
explainer.explore_whatif(X_test, y_test)
Out[36]:

Advanced Feature Importance Options¶

We now show how to use an alternative method for computing feature dependence. Here, we will explain a custom scikit-learn model. Note that the MLExplainer object is capable to explain any classification model, as long as the model follows a scikit-learn-style interface with the predict and predict_proba functions.

We then create the explainer object.

In [37]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, selector(dtype_exclude=[object, 'category'])),
        ("cat", categorical_transformer, selector(dtype_include=[object, 'category'])),
    ]
)
scikit_model = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)


scikit_model.fit(X_train, y_train)

explainer_sklearn = automl.MLExplainer(scikit_model,
                                    X_train,
                                    y_train,
                                    target_names=["<=50K", ">50K"],
                                    task="classification")

Change the number of iterations¶

One can modify the number of iterations n_iter used to evaluate the global importance of the model, or the local importance of a prediction.

Increasing n_iter requires a linear increase in computation time. It however provides more accurate importance estimates, thereby decreasing the variance in repeated calls to explain_model/explain_prediction.

The default value is auto, which selects a suitable default value based on the choice of the method of explanation.

In [38]:
result_explain_model_increase_n_iter = explainer_sklearn.explain_model(n_iter=20)
result_explain_model_increase_n_iter.show_in_notebook()

Include the effects of feature interactions (with Shapley feature importance)¶

The Oracle AutoMLx solution allows one to change the effect of feature interactions. This can be done through the tabulator_type argument of both global and local importance methods.

tabulator_type can be set to one of the following options:

  • permutation: This value is the default method in the MLExplainer object, with the behaviour described above

  • shapley: Feature importance is computed using the popular game-theoretic Shapley value method. Technically, this measures the importance of each feature while including the effect of all feature interactions. As a result, it runs in exponential time with respect to the number of features in the dataset. This method also includes the interaction effects of the other features, which means that if two features contain duplicate information, they will be less important. Note that the interpretation of this method's result is a bit different from the permutation method's result. An interested reader may find this a good source for learning more about it.

  • kernel_shap: Feature importance attributions will be calculated using an approximation of the Shapley value method. It typically provides relatively high-quality approximations; however, it currently does not provide confidence intervals.

  • shap_pi: Feature importance attributions will be computed using an approximation of the Shapley value method. It runs in linear time, but may miss the effect of interactions between some features, which may therefore produce lower-quality results. Most likely, you will notice that this method yields larger confidence intervals than the other two.

Summary: permutation can miss important features for AD. Exact SHAP (shapley) doesn't, but it is exponential. kernel_shap is an approximation of exact SHAP method that does not provide confidence intervals. shap_pi is linear, thus faster than exact SHAP and kernel_shap but unstable and very random leads to lower quality approximations.

Local feature importance with kernel_shap¶

In this example, we also enable sampling within the explainer to speed up the running time, because kernel SHAP is slower than permutation feature importance.

In [39]:
explainer_sklearn.configure_explain_prediction(tabulator_type="kernel_shap",
                                               sampling={'technique': 'random', 'n_samples': 2000})
In [40]:
index = 0
result_explain_prediction_kernel_shap = explainer_sklearn.explain_prediction(X_train.iloc[index:index+1,:])
result_explain_prediction_kernel_shap[0].show_in_notebook()

Explain the model or Explain the world¶

Oracle AutoMLx solution also provides the evaluator_type attribute, which allows one to choose whether to get feature importance attributions that explain exactly which features the model has learned to use (interventional), or which features the model or a retrained model could have learned to use (observational).

  • interventional : The computed feature importances are as faithful to the model as possible. That is, features that are ignored by the model will not be considered important. This setting should be preferred if the primary goal is to learn about the machine learning model itself. Technically, this setting is called 'interventional', because the method will intervene on the data distribution when assessing the importance of features. The intuition of feature importance attributions computed with this method is that the features are dropped from the dataset and the model is not allowed to retrain.

  • observational : The computed feature importances are more faithful to the relationships that exist in the real world (i.e., relationships observed in the dataset), even if your specific model did not learn to use them. For example, when using a permutation tabulator, a feature that is used by the model will not show a large impact on the model's performance if there is a second feature that contains near-duplicate information, because a re-trained model could have learned to use the other feature instead. (Similarly, for Shapley-based tabulators, a feature that is ignored by the model may have a non-zero feature importance if it could have been used by the model to predict the target.) This setting should be preferred if the model is merely a means to learn more about the relationships that exist within the data. Technically, this setting is called 'observational', because it observes the relationships in the data without breaking the existing data distribution.

Explain the model with observational evaluator_type¶
In [41]:
explainer_sklearn.configure_explain_model(evaluator_type="observational")
In [42]:
result_explain_model_observational = explainer_sklearn.explain_model()
result_explain_model_observational.show_in_notebook()

Local feature importance using surrogate models (LIME+)¶

The Oracle AutoMLx solution allows one to change the type of local explainer effect of feature interactions. This can be done through the explainer_type argument of local importance methods.

explainer_type can be set to one of the following options:

  • perturbation: This value is the default explainer type in local feature importance. As we showed above, the explanation(s) will be computed by perturbing the features of the indicated data instance(s) and measuring the impact on the model predictions.

  • surrogate: The LIME-style explanation(s) will be computed by fitting a surrogate model to the predictions of the original model in a small region around the indicated data instance(s) and measuring the importance of the features to the interpretable surrogate model. The method of surrogate explainer can be set to one of the following options:

    • systematic: An Oracle-labs-improved version of LIME that uses a systematic sampling and custom sample weighting. (Default)
    • lime: Local interpretable model-agnostic explanations (LIME) algorithm (https://arxiv.org/pdf/1602.04938).
In [43]:
explainer_sklearn.configure_explain_prediction(explainer_type='surrogate')
In [44]:
index = 0
result_explain_prediction_surrogate_lime = explainer_sklearn.explain_prediction(X_train.iloc[index:index+1, :])
result_explain_prediction_surrogate_lime[0].show_in_notebook()
Explain predictions with surrogate explainer_type (systematic)¶
In [45]:
explainer_sklearn.configure_explain_prediction(explainer_type='surrogate',
                                               method='systematic')
In [46]:
index = 0
result_explain_prediction_surrogate_systematic = explainer_sklearn.explain_prediction(X_train.iloc[index:index + 1, :])
result_explain_prediction_surrogate_systematic[0].show_in_notebook()

Advanced Feature Dependence Options (ALE)¶

We now show how to use an alternative method for computing feature dependence: accumulated local effects (ALE). ALE explanations are sometimes considered a better alternative to PDPs when features are correlated, because it does not evaluate the model outside of its training distribution in these cases. For more information, see https://christophm.github.io/interpretable-ml-book/ale.html.

Given a dataset, an ALE displays the average change in the output of the model, accumulated of multiple small changes in one or two features, when all other features are held fixed. By default, the ALE explanations are centered around 0, and thus, unlike PDPs, ALEs show the change in the prediction measured by changing a given feature, rather than the average model's prediction for a particular feature value.

One-feature ALEs¶

It can be computed for a single feature, as in the cell below. The X-axis is the value of the education-num feature and the y-axis is the corresponding computed ALE (price unit).

The histogram on top shows the distribution of the value of the education-num feature in the dataset.

In [47]:
explainer_sklearn.configure_explain_feature_dependence(explanation_type='ale')
result_explain_feature_dependence_ale = explainer_sklearn.explain_feature_dependence('education-num')
result_explain_feature_dependence_ale.show_in_notebook()

Two-feature ALEs¶

We can also compute ALEs for two features (provided that at least one of them is numerical). The plot below is the ALE plot for the education-num and sex features. The X-axis still shows the values of education-num however, there are now multiple lines, one for each value of the feature sex.

The histogram displays the joint distribution of the two features.

In [48]:
result_explain_feature_dependence_default = explainer_sklearn.explain_feature_dependence(['education-num', 'sex'])
result_explain_feature_dependence_default.show_in_notebook()

References¶

  • More examples and details: http://automl.oraclecorp.com/
  • Oracle AutoML http://www.vldb.org/pvldb/vol13/p3166-yakovlev.pdf
  • scikit-learn https://scikit-learn.org/stable/
  • Interpretable Machine Learning https://christophm.github.io/interpretable-ml-book/
  • LIME https://arxiv.org/pdf/1602.04938
  • UCI https://archive.ics.uci.edu/ml/datasets/Adult